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Due to the extensive use of social media and mobile devices, unbounded and 
massive data is generated continuously. The need to process this big data is 
increasing day by day. The traditional data processing algorithms fail to 
cater to the need of processing data generated by various applications such 
as digital geo-based advertising, and recommendation systems. There has 
been a high demand to process continuous spatial fuzzy textual queries over 
data stream of spatial-textual objects with high density by present location- 
based and social network-based service applications. For the spatial- 
keyword data stream, the performance plays a vital role as the geo 
information and keyword description matching is needed for every incoming 
streaming object. The various continuous geo-keyword query processing 
methods normally lack the support for fuzzy keyword matching when 
processing the objects from the geo-textual data stream. The edit distance- 
based approach with the adaptive partitioning tree index for the queries is 


used for fuzzy string matching and it outperforms than the existing 
approaches in storage cost and query performance cost. 
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1. INTRODUCTION 

Due to the increase in locationequipped devices and digital content generation, massive data is 
generated with textual and spatial information, described as spatial textual data. This data frequently come in 
streaming fashion in numerous applications viz social networks (e.g., Twitter and Facebook) and geo-enabled 
services (e.g., location-based digital advertising). It is communicated that millions of internet users are 
sending token-based data to Twitter. To comprehend spatial-textual data, it is essential to employ high- 
performance analytical methods. The main issue arises because of the processing of continuous spatial- 
textual queries instead of simple spatial-textual streams. It plays a pivotal role in many modern-day 
applications encompassing the dissemination of information [1], geo-based recommendation systems [2]. 
While processing geo-keyword queries over geo-textual continuous data, the rate of arrival of geo-textual 
data is rapid and queries are also dynamic, so the efficiency of execution of such types of queries plays a vital 
role. To address these issues, indexing for spatial-textual queries is needed which suits adaptability to both 
the keyword and spatial distributions of the submitted queries. As per the previous work in the related 
domain, two techniques are discussed [3], [4] to handle the processing of spatial-textual data with continuous 
queries on massive data. While processing geo-keyword queries, either the spatial-based filtering is 
considered followed by keyword filtering in query workload or keyword-based filtering first then followed 
by spatial filtering. In previous work, the R- tree structure is used for spatial representation of the queries, and 
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inverted lists are used for keyword storage. The solution for the same can be extended by using indexing for 
both keyword and spatial distributions in the workload of queries for higher efficiency. Chen et al. [3] the 
method of inverted indexing is used, but these are not efficient for textual filtering as the problem in 
consideration is a subset containment search in a textual outlook. Even though the inverted indexing techniques 
are established in spatial-keyword querying, they are mainly considered with superset containment queries [5], 
in which all query keywords contained within indexed objects are retrieved. Enhancement in performance is 
visible when the keywords’ ordering is handled and combinations of multiple keywords are indexed. This can 
be seen in techniques defined for queries for superset containment using the ordered keyword trie [6]. 

Galić et al. [7] presented a framework that consists of the types of data and the operations 
supporting spatial-streaming data. A spatial-temporal language for queries is discussed [8] for processing 
geo-streaming data. A distributed framework under geo-streams is developed [9], [10] for efficiently 
supervising motile data objects using distributed geo-streaming data processing in massive clusters while in 
real-time. Although all these literary works are based only on the GeoStreams’ spatial dimension, there is 
also a presence of textual information in geo streams. In force, large amounts of geo-spatial data generated 
recently also include geo-tagged micro-blogs, points of interest (POIs), images containing tags and 
geo-locations [3], [11]. As per the information in [12], approximately 30 million users submitted data with 
geo-tags on Twitter. Also, much such software like social networking sites (Facebook and Twitter) and 
regional services (regional advertising) send data in a rapid streaming pattern [12]. The hybrid indexing for 
the geo-spatial k-nearest neighbor (KNN) queries is discussed [13]. The authors used the tree approach and 
the space-filling curve concept for indexing the queries [14]. Various similarity measures to find the nearest 
objects are discussed [15]. The authors suggested a hybrid framework combining R-tree for location data and 
the inverted file for keyword purpose [16]. 

A new hybrid indexing method, adaptive geo-textual (spatial-textual) partition tree (AP-Tree), is 
coined to efficiently manage continuous and dynamic spatial-keyword queries. An AP-Tree can be defined as 
an f-ary tree in which there is a recursive division of queries based on keyword or spatial partitions (nodes). 
A cost model is also proposed to look over the assortment of division techniques so that the indexing is 
flexible with the query workload by integrating an alternative form of ordered keyword trie structure to 
improve the performance of textual filtering [12]. Nonetheless, the indexing for continuous spatial-keyword 
queries methods faces two fundamental problems in the present scenario. Firstly, these indexing approaches 
do not take into account partial keyword matching. In many applications, exact string matching does not cater 
to the need. Partial or fuzzy keyword matching is required for users who do not have a clear search condition, 
or some keyword errors (spelling errors), or when the queried data itself has some sort of unreliable 
fragments. This feature of approximate keyword matching is necessary for the processing of geospatial-based 
data, affirmed by the studies in [17]. The previous work for such type of query processing focused on mainly 
exact keyword matching. 

Recently, the spatial-keyword search has been a sight of interest, which intends to fetch the related 
spatial-textual objects for a submitted spatial-keyword query. In general, the current work is a combination of 
spatial indexing and keyword indexing methods to construct objects so that incompatible objects are efficiently 
eliminated from textual and spatial perspectives. Broadly, these methods can be categorized into two classes: 
keyword-first [18]-[21] and spatial-first [16], [22]. In continuous query processing systems, there are quite a few 
endless queries that keep on running continuously. The incoming data objects are evaluated constantly and 
assigned to the matched queries logged in the system. Multiple works on publish/subscribe models explore a lot 
of continuous queries like predicate-based matching [23]-[26] and similarity-based ranking [27], [28]. However, 
spatial data is not taken into consideration. More recently, continuous dynamic spatial keyword queries are 
under study [29], [30], however, they look into continuous addressing of appropriate data objects with dynamic 
queries, making it fundamentally incompatible with the issue. However, it is proposed a time-based 
publish/subscribe system while taking into account both textual and spatial factors and semantic-based top-k 
search is performed [31]. 

Here, it explores the most notable works encompassing indexing techniques for geo-textual data. The 
techniques of indexing of geo-textual data are broadly categorized into two approaches, i.e., for static data and 
dynamic streaming data. In the spatial data approach, all of the spatial data is being retrieved from a spatial 
database with each object defined by a set of keywords. The location information and keyword information are 
to be stored in all incoming objects. These techniques store both the spatial information and keywords of every 
data object to execute spatial-keyword queries. The data structure R-tree has been greatly used to ground geo- 
textual data in these methods. For example, in the work [32], the authors looked into a hybrid and combined 
indexing structure that manages inverted lists for the document tokens and also incorporated an extended R- 
trees data structure for geo-textual tokens. Along the same lines, Zhang et al. [33] put forward an bR*-tree, an 
extension of R*-tree with bitmaps. Also, various spatial indexing techniques have been explored for geo- 
keyword data. An inverted linear Quadtree (IL-Quad tree) [20] formed on an inverted index. The linear 
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Quadtree was proposed for top-k search based on location and keyword. The k-dimensional (KD) tree structure 
with an inverted list was proposed for the indexing of geo-textual data [34]. 

The methods used in the indexing of static spatial-textual data will not suffice to incorporate the 
dynamic geo-textual data. Hence, in recent years, different query indexing attempts, which work on the 
indexing of continuous queries for the filtering process of geo-textual data are being looked into to tackle this 
issue. These methods of indexing are roughly classified into three categories: indexing on keyword-first, 
indexing on spatial-first and adaptive indexing. The inverted file quad-tree (IQ-tree) [3] uses a Quadtree to 
classify queries such that every query is connected with one or more Quadtree grids/cells. All the queries in 
every cell are allocated to the posting list having its frequent keyword using a ranked-key-based inverted list. In 
the same way, the R‘tree i.e., R-tree with tokens [4] uses the R-tree first for indexing the queries considering 
their spatial data, and then every R-tree node stores the keywords of its child queries for appropriately filtering 
the textual data. The spatial data feature is the only determining factor during the construction of the tree 
structures in IQ-Tree [3] and R‘tree [4]. Thus, they take a hit in the overall performance as they use separate 
keyword and spatial distributions in the submitted queries. To tackle this, the authors proposed an adaptive 
textual-spatial partition tree data structure i.e., AP-tree [12] utilizes an f-ary data structure that processes the 
indexing of the queries in a flexible way considering the workload of queries. But the problem lies in the fact 
that the current indexing techniques for continuous and dynamic spatial-keyword queries do not support 
approximate or partial keyword searches. The authors put forth an MHR Tree i.e., R Tree using a min-wise 
signature with linear hashing which combines the R-tree and Min-wise signature to execute spatial keyword 
queries where the objects are searched for approximate string matching [17]. The authors used the advanced 
adaptive partitioning tree concept in which q-grams’ signatures are stored in an advanced adaptive partition tree- 
AAP tree (AP-Tree+). 

The dynamic programming approach is used to dividing keywords into different partitions [35]. The 
sub-queries are generated from continuous queries using a simulated annealing algorithm. The push and pull- 
based data dissemination approaches are used to send changes to the user as per the user’s interest [36]. A 
significant amount of temporal data is generated in network monitoring, and stock exchange. Which can be used 
in online decision making as dynamic data items are generated [37]. A presorted-nearest index tree algorithm is 
introduced for nearest neighbor queries on mobile objects within desired period and outperformed in saving 
time compared to KD-tree approach [38]. Vantage point tree indexing with spectral clustering is used for high 
dimensional data for effective data retrieval for user query [39]. The authors discussed various real time 
classification and clustering techniques for data stream as well as platforms for mining of data streams [40]. A 
new framework for location-based services under the umbrella of mobile cloud computing providing the 
location privacy and integrity of service results is discussed [41]. 

As the incoming query data is massive in number, the need to come up with effective indexing 
techniques so that quite a few incompatible queries can be refined at a nominal cost has become vital. In this 
paper, for the processing of spatial keyword temporal region queries, a Levenshtein distance adaptive 
partition tree (LDAP tree) is proposed. In the LDAP tree, the keywords are stored for approximate string 
matching. A locally optimal method is used for keyword partitioning and the Levenshtein edit distance 
concept is used for approximate string matching. The paper emphasizes the challenges of approximate 
keyword searching and providing an efficient solution for streaming data. 


2. PROPOSED METHOD 

For the processing of spatial keyword temporal region queries, a LDAP tree is proposed containing 
keyword nodes, spatial nodes and query nodes. A locally optimal greedy method is used for keyword 
partitioning on keyword node. In LDAP tree, the keywords are stored for approximate string matching. 


2.1. Problem formulation 
The basics of spatial keyword temporal data stream and spatial keyword temporal queries are provided. 

- Definition 1: Spatial keyword temporal object is defined as O = (id, t, loc, kws) where t is a timestamp 
and loc is a location of the object. The set of keywords associated with the object is denoted by kws. 

- Definition 2: Spatial keyword temporal data stream is defined as S = {O;| i € [1, + œ] A Oj .t < Oust}. 
It is an unbounded set of spatial-keyword objects in increasing timestamp order. 

- Definition 3: The continuous spatial-keyword temporal query is denoted as q = (id, tı, t2, r, kw) where tı 
and tz are the timestamps during which the query exists and r is the rectangular region of the query. The 
kw is a Set of distinct keywords associated with the query. A Spatial-keyword temporal object O in stream 
S matches the query q if the three conditions, as stated in (1), are fully satisfied. 


o.loc E q.randq.t, < 0.t < q. t2 and q. kw © o. kws and sim(q. kw, 0. kws) = sthr (1) 
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In consequent sections, continuous spatial-keyword-temporal query and spatial-textual-temporal object 
are abbreviated as query and object respectively. It is assumed that the keywords in the queries and objects are 
in sorted order such that Wi<W; if i<j where Wj and Wj are keywords in queries and objects. 

For example: There are four spatial-keyword queries {q1, q2, q3, q4} and two objects {O1, O2}. The 
keyword set for O; is {Wi, W3, We} and O2 is {Wi, W3, Ws}. So, the object O; satisfy the spatial and keyword 
condition for qı as O; contain all the keywords of qi. The object Oz satisfy the spatial and keyword condition for 
qaas shown in Figure 1. 


a n 


Figure 1. The example for set of spatial-keyword queries 


3. METHOD 
3.1. Continuous query processing using LDAP tree 

The architecture of the system is as shown in Figure 2 is: i) Continuous Query Submission: The 
continuous queries are given by the user as per the definition stated in section 2.1. ii) Construction of Index 
for Query/Update: The hybrid indexing approach LDAP Tree is used for indexing the queries. For dividing 
the queries into different subgroups, the local optimal selection based greedy approach is used. The 
construction and updating of the index will be performed. iii) Acquisition of data stream containing objects 
as stated in section 2.1. vi) The objects will be mapped to the index for searching relevant queries using 3 
predicates as stated in section 2.1. It will check for locality constraint, temporal constraint, and matching of 
the keyword for approximations rather than exact. The edit distance (Levenshtein distance) based approach is 
used for fuzzy matching of keywords. v) The relevant queries for the object will be returned. 


Spatial-keyword 
temporal queries 


Construct the hybrid index for the spatial- 
keyword temporal queries 


Map the objects in data stream to 
the query index 


Data 
Stream 


= 
= 


Check for the objects for spatial, keyword 
matching and temporal predicates 


Set of matched queries 


Figure 2. Architecture for evaluation of spatial-textual region queries 


3.2. Spatial and keyword filtering framework 

For indexing spatial-keyword-temporal queries and searching the object from the unbounded data 
stream, Levenshtein distance-based adaptive spatial-keyword filtering approach (LDAP) is used as shown in 
Figure 3. The two-way partitioning is used for spatial filtering and keyword filtering. The queries are 
recursively partitioned in a top-down manner using spatial division and keyword division. LDAP tree 
comprises of three types of nodes: spatial node, keyword node, and query node. A node in a tree can be a 
keyword node or spatial node depending on the method of partitioning used. The term cut c is used to define 
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the degree of the intermediate node. The queries will be stored at the leaf node of the tree. Each query can be 

present in one or more query leaf nodes as per its set of keywords and the geo region. The construction of an 

index for spatial-keyword-temporal queries can be explained in detail by considering spatial partitioning and 

keyword partitioning in detail [12]. 

= Spatial node and its partitioning: The spatial region are recursively subdivided into smaller c spatial 
nodes. The region of the spatial node N is denoted by Nr which is subdivided again into c spatial grid 
nodes. A query q on that spatial node N is assigned to cut or cell C if query region q.r contains C or 
overlaps C. The query can be allocated to multiple spatial child nodes. 

- Keyword node and its partitioning: Assuming the ordering of keywords in the vocabulary (set of all 
distinct keywords), in each query. For the generation of the cuts c for the given keyword node, the 
queries allocated to that node are divided into c ordered cuts according to the index of that keyword in 
the query and that index will decide the partition offset of that keyword node N. Each cut or bucket B 
may contain the set of ordered keywords C[Ki, Kj] where Ki and Kj are keywords for left and right 
boundaries in the given cut. For dividing the keyword node into different cuts, a cost model is used. The 
cost of partition P is computed as [12]: 


C(P) = Xf- w(Bi) * p(Bi) (2) 
where C: number of cuts and w(Bi): number of queries associated to Bi. 


p(B) = Zwes p(w) (3) 


_ freq(w) 
POW) = 5 ep frea) 4) 
p(B) = area (B)/area (N) (5) 


- If q contains keywords {W,, W2..., Wn}, the first m keywords (where 1<m<n) in {W1, W2..., Wn} 
will be explored in previous keyword nodes only if there are m number of keyword nodes in the q- 
node’s ancestor nodes including parent node. So, only last (n-m) keywords for each query will be 
stored. Where KW and SP denote keyword node and spatial node respectively. 

As discussed in Algorithm 1, the greedy approach is used for the keyword partition of query 
index where all the keywords are divided into c partitions/cuts. Algorithm 2 describes the procedure of 
building an index for query dataset Q where where keyword and spatial nodes is generated using cost 
model. The dummy cut/cell is used for the queries if the number of keywords in a query is less than the 
offset of the partition or the region of the query contains the region of the spatial node. The object 
searching is done using Algorithm 3 on query index where the searching is done from the root node and 
every keyword in the incoming object is searched in each partition of keyword node. In case of spatial 
node, the location of the object is checked in the cells of spatial node. 


Figure 3. LDAP tree for query indexing 
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Algorithm 1: GreedyMethod_for_keyword Partition (KW, c) 

Input: KW: Set of keywords, c: fanout of node i.e., cuts 

Output: Pe: Partition with keywords with c cuts 

1) find an initial solution € partition Pc which partitions KW into c cuts with similar 
weights; 

2) for (2<=j<=c) do 

3) for every keyword u lies between left (Cj-1) and right (cj) do 
4) calculate C(Pc) // using cost model 

5) if (lower C(Pc) is achieved) 

6) Update cuts Cj-ı and cj in Pc 

7) return Pc 


Algorithm 2: Index_construct (CN, Q, /, Pk, Ps) 

Input Q: Query dataset, CN: current node, 1: offset for partitioning keywords 
Ps and Pk: flags for spatial and keyword partitions respectively, thr: threshold 
value for storing queries in query node 

Output: LDAP-Tree 

1 if (Pk is not true and Ps is not true) or |Q| < thr, then 

2 CN is a query-node for set of queries Q 

3 return 

4 k-cost = +% s-cost = + œ% 

5 if (Pk is true) then 

6 k-cost © compute cost for partition on keywords on Q having offset 1 

7 if (Ps is true ) then 

8 s-cost © compute cost for partition on spatial on Q 

9 if (Pk) is selected (i.e., k-cost < s-cost) then 
CN is a keyword-node having offset Nı = 1 
create dummy cut for Queries Qt where |q.kw| < 1 
for every child node B (i.e., cut) of node CN do 

Qs = Q - Q? 
Index _construct(B, Qs., 1 + 1, Pk, Ps); 
if CN is a spatial-node 
Qt = queries from Q contains Nr 
create dummy cut for Queries Q¢9 
for every child node B (cell) of node CN do 
Qa = Q - Qa 
index _construct(B, Qs, 1, Pk, Ps); 


POW OWMOIA SF WNHER CO 


NN 


Algorithm 3: ObjectSearch (0, s, CN) 

Input: o: incoming spatial-keyword object, s: start position wrt o.kws, CN: current accessed node 
Output: A: Result set containing the matched and relevant queries for an object 
1 if CN is a query-node then 

2 check_queries (N) 

3 update A 

3 return 

4 if CN is a keyword-node then 

5 for s <= j <= o.kws do 

6 find appropriate cut based on Kj in (o.kws) 

7 if any cut is not explored then 


8 ObjectSearch(o, j + 1, cut) 

9 if exist (dummy_cell) then 

10 ObjectSearch(0, s, dummy cell) 

dle else 

12 find the grid that covers o.loc using cell structure; 
13 ObjectSearch(0, s, cell) 

14 if dummy_cut is present then 

15 ObjectSearch(0, s, dummy cut) 


In the algorithm check _queries(), the following conditions are checked. 

1) The superset containment of keywords in queries to keywords in the object. 

2) The object location lies in the region of query and time constraint is checked. 

3) The approximate matching of the keyword in query and the keyword object is done using 
Levenshtein distance method at query node as well as keyword node for every query keyword. 


4. RESULTS AND DISCUSSION 
The results of experimental evaluation are presented. Experimental Setup: The geo-keyword-based 
services have been frequently used in numerous applications e.g., social media, and digital advertising. 
To evaluate the discussed indexing approach, the following datasets are used. 
i) AIS dataset contains the geo-locations which are taken from Chorochronos Archive 
(http://www.chorochronos.org) [35]. 
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ii) Keywords are taken from Newsgroups with about 61,000 keywords 
(http://people.csail.mit.eduljrennie120Newsgroups) [35]. Object dataset is generated using the spatial 
locations and keywords from vocabulary (Newsgroups): 500,000 with 1-9 keywords. 

iii) To generate query workload: 100,000 spatial-keyword objects from the dataset are selected. From every 
selected object, the keywords for queries are selected randomly from the selected objects varying it between 
1 and 5. The query region of the query is taken as rectangle with center as the spatial location of the object. 
The region size is selected as 0.001. The experiments are run on 17-7500U @2.70 GHz with 8 GB RAM. 

iv) Parameters of Experiments: Query dataset is of size 10000, thr (the maximum no. of queries in a query 
node) =40, no. of partitions c=200, sthr=0.5. 

The results i.e., query execution time and storage cost for query index is compared with the methods 
used in [35] as shown in Figure 4 and Figure 5. To evaluate the performance of the queries on data stream, 
which is continuous, where the queries are also dynamic, the two measures are considered i.e., query 
execution cost and storage cost for query indexing. As the keywords are stored instead of q-gram signatures 
in the indexing data structure for the queries, it reduces the storage cost of the index significantly. During the 
object searching, the fuzzy keyword matching is done based on the edit distance (Levenshtein distance) 
method as the keywords are short text descriptions. It results in faster execution time. 


Query Execution Time 


gue AAP-Tree 


—@=— LDAP-Tree 


Execution Time ( Sec.) 


o T T T T 1 
100K 200K 300K 400K 500K 
Dataset Size 


Figure 4. Comparison of query execution time 


Memory for Indexing 


—¢— AAP-Tree 
=.= LDAP-Tree 


Index Memory (KB) 


10K 20K 30K 40K 50K 


Query dataset size 


Figure 5. Storage cost for indexing query dataset 


5. CONCLUSION 

It is necessary to cater to the increasing need for processing spatial-keyword-temporal data 
efficiently which is generated continuously due to the extensive use of social media and web search. The 
index for finite queries was generated for efficient execution of spatial-keyword queries which are less 
dynamic than the streaming data. The approximate string matching suited to these applications rather than 
exact string matching. The edit distance-based approach was used for fuzzy string matching which is suitable 
for short text descriptions with typo errors. The LDAP tree outperformed in storage costs for queries and 
execution cost for object searching than the previous methods. The approach can be extended to multiple 
streams and distributed query processing. 
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